CS-466/566: Math for AI

Neural Networks Definition (a.k.a. Deep Learning)

A deep learning model is a computational graph that try to map inputs, each drawn from some dataset with common characteristics to outputs drawn from a related distribution.

It is a graph that has many layers (Simply Nested Functions).

That is why gradient descent is a bit tricky here.

Regression vs Classification Neural Networks

Classification

Use a sigmoid (or another squashing nonlinearity) on a single output so the readout behaves like a probability or score — that setup is classification.

Regression

Use a linear readout from the hidden layer, e.g. \(\sum_i w^2_{i,1}\, H_i\), to predict a real number — that setup is regression.

Binary Classification

(One vs All) Multi-class Classification

Given a classification problem with N possible solutions, a one-vs.-all solution consists of N separate binary classifiers

Note. The output vector has to be encoded using one hot encoding

Training a Neural Network

L is the loss function.
Y is the target value.
\(L(Y, O)\) is the loss between the target value and the output.
We need to find each \(\frac{\partial L}{\partial w^k_{ij}}\)
Calculus chain rule is needed to find the derivatives.

More Accurate Multi-class Classification

Previously, we used sigmoid with mean squared error (MSE) as loss function

Pros
- Convex (i.e., steeper gradient as you are away from the correct output)
- Acceptable performance
Cons
- Outputs can not be interpreted as probability.
- Does not work with all problem types

In practice, we use Softmax + cross Entropy loss

Softmax Function

Softmax extends binary logistic regression idea (probability adds up to 1) into a multi-class world.
It helps training converge more quickly than it otherwise would.

Let \(z\) denote the logits (the raw, unnormalized scores output by the network before applying softmax).

How it works?

Suppose we have an input vector from the previous layer

[5, 3, 2]

One way to transform these values into a vector of probabilities

A better way

Intuition

Softmax is more strongly amplifies the maximum value relative to the other values.
Softmax is partway between normalizing the values and actually applying the max function!

Our Neural Network

Loss Function Diagram View

The incoming vector \(\textbf{x}\) is transformed by the softmax function to produce the vector of probabilities \(\textbf{p}\).
The vector of probabilities \(\textbf{p}\) is then compared to the vector of actual values \(\textbf{y}\) using the cross entropy loss function resulting in a scalar loss value.

Cross Entropy

Recall that any loss function will take in a vector of probabilities and a vector of actual values

The cross entropy loss function, for each index i in these vectors, is:

\(\ell_i\) = \(-y_ilog(p_i)-(1-y_i)log(1-p_i)\)

Intuition

Since \(p_i\) is a probability between 0 and 1.
Our loss can be defined as follows.
- if \(y_i = 1\) : Loss = \(-log(p_i)\)
- if \(y_i = 0\) : Loss = \(-log(1 - p_i)\)

Gradient Computation

The real magic happens when we combine this loss with the softmax function

\(-\ell_1 = y_1 \log \left( p_1 \right) + (1-y_1) \log \left(1 - p_1 \right)\)

\(\Downarrow\)

\(-\ell_1 = y_1 \log \left(\frac{e^{x_1}}{e^{x_1}+e^{x_2}+e^{x_3}}\right) + (1-y_1) \log \left(1 - \frac{e^{x_1}}{e^{x_1}+e^{x_2}+e^{x_3}}\right)\)

Based on this expression, the gradient would seem to be a bit trickier for this loss. (Proof Is Required)

\(\displaystyle \frac{\partial L_i}{\partial x_i} = p_i - y_i\)

Proof: \(\frac{\partial L}{\partial x} = p - y\)

\(-\ell_1 = y_1 \log \left(\frac{e^{x_1}}{e^{x_1}+e^{x_2}+e^{x_3}}\right) + (1-y_1) \log \left(1 - \frac{e^{x_1}}{e^{x_1}+e^{x_2}+e^{x_3}}\right)\)

Proof.

Split logarithm of fraction for first term

\(-\ell_1 = y_1 \left[ \log(e^{x_1}) - \log(e^{x_1}+e^{x_2}+e^{x_3}) \right] + (1-y_1) \log \left[1 - \frac{e^{x_1}}{e^{x_1}+e^{x_2}+e^{x_3}}\right]\)
Simplify \(\log(e^{x_1}) = x_1\quad\quad\) and rewrite complement fraction

\(-\ell_1 = y_1 x_1 - y_1 \log(e^{x_1}+e^{x_2}+e^{x_3}) + (1-y_1) \log \left[\frac{e^{x_2}+e^{x_3}}{e^{x_1}+e^{x_2}+e^{x_3}}\right]\)
Split logarithm of fraction in last term

\(-\ell_1 = y_1 x_1 - y_1 \log(e^{x_1}+e^{x_2}+e^{x_3}) + (1-y_1) \log(e^{x_2}+e^{x_3}) - (1-y_1) \log(e^{x_1}+e^{x_2}+e^{x_3})\)
Combine like terms with \(\log(e^{x_1}+e^{x_2}+e^{x_3})\)

\(-\ell_1 = y_1 x_1 + (1-y_1) \log(e^{x_2}+e^{x_3}) - \log(e^{x_1}+e^{x_2}+e^{x_3})\)
Take derivative with respect to \(x_1\)

\(-\frac{\partial \ell_1}{\partial x_1} = y_1 - \frac{e^{x_1}}{e^{x_1}+e^{x_2}+e^{x_3}}\)
Recognize softmax probability definition \(p_1 = \frac{e^{x_1}}{e^{x_1}+e^{x_2}+e^{x_3}}\)

\(\frac{\partial \ell_1}{\partial x_1} = p_1-y_1\) (derivative of the log-likelihood \(\ell_1\))

Neural Network As Computational Graph

Two-layer network for multi-class classification (softmax output).

Architecture: one hidden layer (here with sigmoid activations) and one output layer whose activations are logits \(N\); applying softmax yields class probabilities \(P\) (e.g. dog / cat / horse, or 3-way softmax in the diagram).
Task: predict a probability vector over classes — this is multi-class classification, not binary.

Neural Network Computational Graph

\(\textbf{X}\) is a 4x3 matrix (4 samples, 3 features)
\(\textbf{W}^{(1)}\) is a 3x3 matrix
\(\textbf{B}^{(1)}\) is a 1x3 vector
\(\textbf{W}^{(2)}\) is a 3x3 matrix
\(\textbf{B}^{(2)}\) is a 1x3 vector

Computational Graph: \(\partial L/\partial W^{(2)}_{3\times 3}\)

Compute the gradient of \(L\) (cross-entropy loss) with respect to \(W^{(2)}_{3,3}\) (backward path: \(L \leftarrow N \leftarrow W^{(2)}\); \(N\) are logits before softmax).

\(\frac{\partial L}{\partial W^{(2)}_{3,3}} = \frac{\partial L}{\partial R_{4,3}} \odot \frac{\partial R_{4,3}}{\partial N_{4,3}} \odot \frac{\partial N_{4,3}}{\partial W^{(2)}_{3,3}}\) (softmax + CE upstream gradient \(\partial L/\partial R = P-Y\))
\(\frac{\partial L}{\partial R_{4,3}} = P_{4,3} - Y_{4,3}\)
\(\frac{\partial R_{4,3}}{\partial N_{4,3}} = {\textbf{1}}_{4,3}\)
\(\frac{\partial N_{4,3}}{\partial W^{(2)}_{3,3}} = \text{transpose}(M)_{3,4} \cdot {\textbf{1}}_{4,3}\)

Computational Graph: \(\partial L/\partial W^{(1)}_{3\times 3}\)

Compute the gradient of \(L\) with respect to \(W^{(1)}_{3,3}\) (full backward path through hidden sigmoid: \(L \leftarrow N \leftarrow M \leftarrow V \leftarrow U \leftarrow W^{(1)}\)).

\(\frac{\partial L}{\partial W^{(1)}_{3,3}} = \frac{\partial L}{\partial R_{4,3}} \odot \frac{\partial R_{4,3}}{\partial N_{4,3}} \odot \frac{\partial N_{4,3}}{\partial M_{4,3}} \odot \frac{\partial M_{4,3}}{\partial V_{4,3}} \odot \frac{\partial V_{4,3}}{\partial U_{4,3}} \odot \frac{\partial U_{4,3}}{\partial W^{(1)}_{3,3}}\)
\(\frac{\partial L}{\partial R_{4,3}} = P_{4,3} - Y_{4,3}\)
\(\frac{\partial R_{4,3}}{\partial N_{4,3}} = {\textbf{1}}_{4,3}\)
\(\frac{\partial N_{4,3}}{\partial M_{4,3}} = {\textbf{1}}_{4,3} \cdot (W^{(2)}_{3,3})^\top\)
\(\frac{\partial M_{4,3}}{\partial V_{4,3}} = sigmoid(V_{4,3}) \odot ({\textbf{1}}_{4,3} - sigmoid(V_{4,3}))\)
\(\frac{\partial V_{4,3}}{\partial U_{4,3}} = {\textbf{1}}_{4,3}\)
\(\frac{\partial U_{4,3}}{\partial W^{(1)}_{3,3}} = \text{transpose}(X)_{3,4} \cdot \textbf{1}_{4,3}\)

Computational Graph: \(\partial L/\partial B^{(2)}_{1\times 3}\)

Compute the gradient of \(L\) with respect to \(B^{(2)}_{1,3}\) (bias added into logits \(R\); broadcast \(4\times 3\)).

\(\frac{\partial L}{\partial B^{(2)}_{1,3}} = \frac{\partial L}{\partial R_{4,3}} \odot \frac{\partial R_{4,3}}{\partial B^{(2)}_{1,3}}\)
\(\frac{\partial L}{\partial R_{4,3}} = P_{4,3} - Y_{4,3}\)
\(\frac{\partial R_{4,3}}{\partial B^{(2)}_{1,3}} = {\textbf{1}}_{4,3}\)

Note. \(\odot\) gives \(4\times 3\) (one row per sample). \(B^{(2)}_{1,3}\) is shared across rows — sum across rows (column-wise over the batch) for \(\partial L/\partial B^{(2)}_{1,3}\).

Computational Graph: \(\partial L/\partial B^{(1)}_{1\times 3}\)

Compute the gradient of \(L\) with respect to \(B^{(1)}_{1,3}\) (bias in \(V\); broadcast \(4\times 3\)).

\(\frac{\partial L}{\partial B^{(1)}_{1,3}} = \frac{\partial L}{\partial R_{4,3}} \odot \frac{\partial R_{4,3}}{\partial N_{4,3}} \odot \frac{\partial N_{4,3}}{\partial M_{4,3}} \odot \frac{\partial M_{4,3}}{\partial V_{4,3}} \odot \frac{\partial V_{4,3}}{\partial B^{(1)}_{1,3}}\)
\(\frac{\partial L}{\partial R_{4,3}} = P_{4,3} - Y_{4,3}\)
\(\frac{\partial R_{4,3}}{\partial N_{4,3}} = {\textbf{1}}_{4,3}\)
\(\frac{\partial N_{4,3}}{\partial M_{4,3}} = {\textbf{1}}_{4,3} \cdot (W^{(2)}_{3,3})^\top\)
\(\frac{\partial M_{4,3}}{\partial V_{4,3}} = sigmoid(V_{4,3}) \odot ({\textbf{1}}_{4,3} - sigmoid(V_{4,3}))\)
\(\frac{\partial V_{4,3}}{\partial B^{(1)}_{1,3}} = {\textbf{1}}_{4,3}\)

Note. \(\odot\) gives \(4\times 3\) (one row per sample). \(B^{(1)}_{1,3}\) is shared across rows — sum across rows (column-wise over the batch) for \(\partial L/\partial B^{(1)}_{1,3}\).

Thank You!

CS-466/566: Math for AI

TABLE OF CONTENTS

Neural Networks Definition (a.k.a. Deep Learning)

Regression vs Classification Neural Networks

Binary Classification

(One vs All) Multi-class Classification

Training a Neural Network

TABLE OF CONTENTS

More Accurate Multi-class Classification

Softmax Function

How it works?

Intuition

Our Neural Network

Loss Function Diagram View

Cross Entropy

Intuition

Gradient Computation

Proof: \(\frac{\partial L}{\partial x} = p - y\)

TABLE OF CONTENTS

Neural Network As Computational Graph

Neural Network Computational Graph

Computational Graph: \(\partial L/\partial W^{(2)}_{3\times 3}\)

Computational Graph: \(\partial L/\partial W^{(1)}_{3\times 3}\)

Computational Graph: \(\partial L/\partial B^{(2)}_{1\times 3}\)

Computational Graph: \(\partial L/\partial B^{(1)}_{1\times 3}\)

Thank You!